Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 55

2.2 READ MAPPING

Sequence alignment is one of the most important tasks in bioinformatics. It has been used

for decades before the recent revolution in genomics which erupted after the discovery

of the high-throughput sequencing technologies. It has been used in numerous bioinfor-

matics applications. The basic alignment algorithms are the global sequence alignment

using Needleman-Wunsch [1, 2] algorithm and the local sequence alignment using Smith–

Waterman (SW) [3] algorithm. Both these two legacy algorithms use dynamic programing,

which was invented by the mathematician Richard Bellman in early 1950s. The dynamic

programming in sequence alignment, which has nothing to do with computer program-

ming, is an algorithm that attempts to find the optimal alignment of two sequences by

dividing the whole alignment into smaller sub-alignments. The global sequence align-

ment is to align two sequences from end to end, and therefore, gaps can be introduced to

stretch the sequences to be equal in length. It is obvious that this type of alignment does

not work for aligning the massive number of short sequences to a genome sequence that

may span millions of bases. The second type is the local sequence alignment, which aligns

two sequences only to the regions of similarity. It is quite clear that this kind of sequence

alignment will work for aligning reads to a reference genome. The local sequence align-

ment in its crude form is exhaustive in a way that if you a have a single sequence and you

want to align it to multiple sequences, the algorithm will exhaust all possible alignments

and then report the results. In most cases, this approach is not practical for aligning mil-

lions of reads and it is computationally expensive with that massive number of genomics

sequences. The Basic Local Alignment Search Tool or BLAST [4] is heuristic version of the

SW local sequence alignment algorithm, which is more time-efficient than the exhaustive

approach because it searches only for the most significant patterns in the sequences and

performs sequence alignment accordingly. BLAST is fast but it may skip some alignments

because of its heuristic nature. Now after all, we can ask the question: Is BLAST suitable for

aligning millions of short-read sequences to a reference genome that may span millions of

bases? To provide a practical answer to this question, assume that BLAST could align a sin-

gle read to the reference genome in a second; if we have 20 million reads, BLAST will align

them to the reference genome in 20 million seconds (231.5 days), if we ignore the possibility

of the power or Internet outage or any other natural disaster. In conclusion, BLAST is not

efficient for read sequence alignment. Considering the large size of the reference genome

sequences and the large number of reads, one of the biggest challenges to perform search-

ing and locating the mapped region with high accuracy is indexing. Efficient indexing algo-

rithms are needed to organize the sequence of the reference genome and the short reads in

memory efficiently and to facilitate fast searching for patterns. Computer scientists play a

key role in developing data structures that allow computer programs to store and process

data effectively with less computational costs. Efficient data structures will help in both

indexing the sequence of a reference genome and fast searching of a read in an indexed

sequence with efficient memory usage. Several data structures for indexing were proposed

and implemented on software packages for read sequence alignment. The most com-

monly used data structures for indexing include suffix tree, suffix array, Burrows–Wheeler